Dynamic genome reference generation for improved ngs accuracy and reproducibility

ABSTRACT

A “dynamic” reference is presented that utilizes population level information to improve reference-based alignment to detect novel, deleterious, or functional variants in clinical sequencing applications. An automatically updated database of known genetic variants is provided to a memory connected with an integrated circuit configured for genetic sequence data with the dynamic reference and reference variants.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to U.S. Non-Provisional ApplicationSer. No. 14/630,571, filed Feb. 24, 2015, entitled “Dynamic GenomeReference Generation for Improved NGS Accuracy and Reproducibility”,which claims priority under 35. U.S.C. §119(e) to U.S. ProvisionalApplication Ser. No. 61/943,870, filed Feb. 24, 2014, entitled “DynamicGenome Reference Generation for Improved NGS Accuracy andReproducibility”, referenced in this paragraph and incorporated byreference in its entirety.

REFERENCE TO A “SEQUENCE LISTING,” A TABLE, OR A COMPUTER PROGRAMLISTING APPENDIX SUBMITTED AS AN ASCII FILE

The Sequence Listing written in file 49927-502001US_ST25, created Mar.20, 2017, 636 bytes, machine format IBM-PC, MS Windows operating system,is hereby incorporated by reference.

TECHNICAL FIELD

The subject matter described herein relates to genomics, and moreparticularly to dynamic genome reference generation for improved nextgeneration sequencing (NGS) accuracy and reproducibility

BACKGROUND

As DNA sequencing costs have plummeted over the last few years, raw datagenerated by sequencing has increased exponentially, measuring petabytesof data, making analyses and transfer of all this data difficult. Theselarge amounts of data produce a critical bottleneck in the DNAsequencing workflow that has previously only been addressable bythrowing increasing numbers of ever more powerful CPU cores at theproblem. However, since the data being produced by sequencing alreadyfar outpaces Moore's Law, this solution has very limited sustainability.

The hugely parallel approach of NGS requires a human reference genome tobe used to reconstruct the patient's genome from the raw read data. Thehuman reference genome has become essential for clinical applications,and is used to identify alleles for risk, protection, ortreatment-specific response in human disease. Yet, the current referencegenome, GRCh38, being based on a limited number of samples, neitheradequately represents the full range of human diversity, nor iscomplete. Further, the existing approach followed by the GRC and thegenomics industry to construct a “static” reference genome introducesbiases in standard bioinformatic pipelines used to detect the uniquecomplement of variants in an individual's genome. An elegant, costeffective bioinformatics pipeline solution to perform the analysis ofthe sequenced data rapidly, accurately and in a consistent, reproducibleway based on a truly population-wide reference is the final frontier tocommoditize sequencing.

SUMMARY

In one aspect, a Next Generation Sequencing (NGS) bioinformatics ASIC(Application Specific Integrated Circuit) is disclosed. A “dynamic”reference is introduced that utilizes population level information toimprove reference-based alignment to detect novel, deleterious, orfunctional variants in clinical sequencing applications. To generate thedynamic reference, an automatically-updated database of known geneticvariants (SNPs, indels, CNVs, etc., archived at e.g. dbSNP, DGVa, dbVar)is built and provided, and augments a standard reference genome with thevariants, to be processed by the NGS ASIC.

Implementations of the current subject matter can include, but are notlimited to, systems and methods consistent including one or morefeatures are described as well as articles that comprise a tangiblyembodied machine-readable medium operable to cause one or more machines(e.g., computers, etc.) to result in operations described herein.Similarly, computer systems are also described that may include one ormore processors and one or more memories coupled to the one or moreprocessors. A memory, which can include a computer-readable storagemedium, may include, encode, store, or the like one or more programsthat cause one or more processors to perform one or more of theoperations described herein. Computer implemented methods consistentwith one or more implementations of the current subject matter can beimplemented by one or more data processors residing in a singlecomputing system or multiple computing systems. Such multiple computingsystems can be connected and can exchange data and/or commands or otherinstructions or the like via one or more connections, including but notlimited to a connection over a network (e.g. the Internet, a wirelesswide area network, a local area network, a wide area network, a wirednetwork, or the like), via a direct connection between one or more ofthe multiple computing systems, etc.

The details of one or more variations of the subject matter describedherein are set forth in the accompanying drawings and the descriptionbelow. Other features and advantages of the subject matter describedherein will be apparent from the description and drawings, and from theclaims. While certain features of the currently disclosed subject matterare described for illustrative purposes in relation to an enterpriseresource software system or other business software solution orarchitecture, it should be readily understood that such features are notintended to be limiting. The claims that follow this disclosure areintended to define the scope of the protected subject matter.

DESCRIPTION OF DRAWINGS

The accompanying drawings, which are incorporated in and constitute apart of this specification, show certain aspects of the subject matterdisclosed herein and, together with the description, help explain someof the principles associated with the disclosed implementations. In thedrawings,

FIG. 1 shows a Wavefront Processor on a PCIe card for NGS sequencers;

FIG. 2 shows an architecture for mapping, aligning, sorting andde-duplication of genomic sequence data;

FIG. 3 illustrates a mapper/aligner variant calling pipeline comparison;

FIG. 4 illustrates static versus dynamic reference genomes. Sequencelegend: accgattgca gtcaaagtcc tgtgtcacgt gtacttggcg cacaaacctg tg (SEQID NO:1).

FIG. 5 shows a dynamic reference generation pipeline; and

FIG. 6 illustrates size and performance differences between BWA and theWavefront Processor with dynamic reference.

When practical, similar reference numbers denote similar structures,features, or elements.

DETAILED DESCRIPTION

To address these and potentially other issues with currently availablesolutions, methods, systems, articles of manufacture, and the likeconsistent with one or more implementations of the current subjectmatter can, among other possible advantages, provide a Next GenerationSequencing (NGS) bioinformatics ASIC (Application Specific IntegratedCircuit). It enables the computational time required for the NGS dataanalysis pipeline to be radically reduced from many hours down to only afew minutes.

This dramatic speed improvement addresses the “static” reference issuein a way that has not been previously possible. A “dynamic” genomereference is provided that utilizes population level information toimprove reference-based alignment to detect novel, deleterious, orfunctional variants in clinical sequencing applications. To generate thedynamic reference, an automatically updated database of known geneticvariants (SNPs, indels (insertions/deletions), CNVs, etc., archived ate.g. dbSNP, DGVa, dbVar) is also provided, and augments a WavefrontProcessor to utilize this data to enhance the standard reference genomewith these variants.

The Wavefront Processor is shown in FIGS. 1 and 2. The WavefrontProcessor enables the computational time required for the whole genomesequencing (WGS) data analysis pipeline to be radically reduced frommany hours down to only a few minutes at unprecedented quality (FIG. 3).The Wavefront Processor includes a configurable hardware architecture(FPGA) to speed up read mapping, alignment, sorting, and duplicatemarking. Additionally, a number of modular extensions are provided toaddress the need to account for human diversity.

The reference genome has been a guiding principle for the development ofa vast array of computational tools and forms the foundation fordatabases and bioinformatics algorithms that are used to define targetregions for re-sequencing, perform genome wide association studies, ormeasure inter-species conservation. The human reference genome hasbecome essential for clinical applications, and is used to identifyalleles for risk, protection, or treatment-specific response in humandisease. Yet, the current reference genome, GRCh38, being based on alimited number of samples, neither adequately represents the full rangeof human diversity, nor is complete. Further, the existing approachfollowed by the GRC and the genomics industry to construct a “static”reference genome introduces biases in standard bioinformatic pipelinesused to detect the unique complement of variants in an individual'sgenome, as shown in FIG. 4.

To address this problem, the “dynamic” reference is introduced (FIG. 4)that utilizes population level information to improve reference-basedalignment to detect novel, deleterious, or functional variants inclinical sequencing applications. To generate the dynamic reference anautomatically updated database of known genetic variants (SNPs, indels,CNVs, etc., archived at e.g. dbSNP, DGVa, dbVar) is built, and augmentsa standard reference genome with the variants in a manner analogous toconventional methodologies, as shown in FIG. 5.

A Burrows-Wheeler transform (BWT) based aligner has been developed thatmaps reads to a dynamic reference genome. Importantly, alignmentaccuracy can be markedly improved in SNP-dense regions and regions withlong indels that are often problematic for successful alignment of shortreads. For BWT-based aligners (e.g. BWA), as the number of variantsincluded in a dynamic reference increases, memory usage and run timesincrease. As illustrated in FIG. 6, augmentation can double the amountof memory required and cause the algorithm 100 times longer to run. Insome implementations consistent with the current disclosure, the memoryfootprint can be increased approximately 30% with more variants,especially indels. However, for a hash-based algorithm, run times areexpected to remain relatively constant. Furthermore, as more variantsare identified and incorporated into the dynamic reference in thefuture, the alignment accuracy of the Wavefront Processor will continueto improve with little change in run times.

To support a dynamic reference, reads aligning to alternate sequencesthat overlap primary sequences (chromosomes) in most cases must bere-aligned to the correct primary sequence with properly adjusted FLAG,RNAME, POS, MAPQ, and CIGAR SAM fields. The CIGAR strings for a readaligning to an indel alternate sequence must be translated for properalignment with the corresponding primary sequence. If such a read mapsto a rare indel sequence, its MAPQ value may be penalized to decreasethe chance of a false positive variant call. This penalty can bedetermined empirically with ground truth variant call data.

Aligning reads longer than 1000 bases is impractical from a memorystandpoint if indel alternate sequences are padded with sufficient basesto ensure alignment of full reads. For aligning very long reads, acompact representation of indel alternate sequences (without basepadding) will be developed. In essence, each indel sequence must bestitched across the primary sequence region that it overlaps so thatbases flanking the indel are coded just once in the dynamic reference.

Together with already-proven Mapping/Aligning/Sorting technology, asdescribed in U.S. patent application Ser. No. 14/158,758, filed Jan. 17,2014, entitled BIOINFORMATICS SYSTEMS, APPARATUSES, AND METHODS EXECUTEDON AN INTEGRATED CIRCUIT PROCESSING PLATFORM, the contents of which areincorporated by reference herein for all purposes, the dynamic referencegenome and extensions to the Pipeline Processor that are describedherein have a large impact on the quality of analysis results that canbe achieved. Not only are rate and quality of variant identificationincreased from sequence data that is generated by a variety of nextgeneration sequencing technologies, but accuracy of interpretiveanalysis of variant data is improved to provide novel e-diagnostics forthe future, and deeper understanding of disease and its application in aclinical context.

One or more aspects or features of the subject matter described hereincan be realized in digital electronic circuitry, integrated circuitry,specially designed application specific integrated circuits (ASICs),field programmable gate arrays (FPGAs) computer hardware, firmware,software, and/or combinations thereof. These various aspects or featurescan include implementation in one or more computer programs that areexecutable and/or interpretable on a programmable system including atleast one programmable processor, which can be special or generalpurpose, coupled to receive data and instructions from, and to transmitdata and instructions to, a storage system, at least one input device,and at least one output device. The programmable or hardwired system orcomputing system may interface with client computers and servercomputers. A client and server are generally remote from each other andtypically interact through a communication network. The relationship ofclient and server arises by virtue of computer programs running on therespective computers and having a client-server relationship to eachother.

These computer programs, which can also be referred to as programs,software, software applications, applications, components, or code,include machine instructions for a programmable processor, and can beimplemented in a high-level procedural and/or object-orientedprogramming language, and/or in assembly/machine language. As usedherein, the term “machine-readable medium” refers to any computerprogram product, apparatus and/or device, such as for example magneticdiscs, optical disks, memory, Programmable Logic Devices (PLDs), andgate arrays, used to provide machine instructions and/or data to aprogrammable processor, including a machine-readable medium thatreceives machine instructions as a machine-readable signal. The term“machine-readable signal” refers to any signal used to provide machineinstructions and/or data to a programmable processor. Themachine-readable medium can store such machine instructionsnon-transitorily, such as for example as would a non-transientsolid-state memory or a magnetic hard drive or any equivalent storagemedium. The machine-readable medium can alternatively or additionallystore such machine instructions in a transient manner, such as forexample as would a processor cache or other random access memoryassociated with one or more physical processor cores.

To provide for interaction with a user, one or more aspects or featuresof the subject matter described herein can be implemented on a computerhaving a display device, such as for example a cathode ray tube (CRT), aliquid crystal display (LCD) or a light emitting diode (LED) monitor fordisplaying information to the user and a keyboard and a pointing device,such as for example a mouse or a trackball, by which the user mayprovide input to the computer. Other kinds of devices can be used toprovide for interaction with a user as well. For example, feedbackprovided to the user can be any form of sensory feedback, such as forexample visual feedback, auditory feedback, or tactile feedback; andinput from the user may be received in any form, including, but notlimited to, acoustic, speech, or tactile input. Other possible inputdevices include, but are not limited to, touch screens or othertouch-sensitive devices such as single or multi-point resistive orcapacitive trackpads, voice recognition hardware and software, opticalscanners, optical pointers, digital image capture devices and associatedinterpretation software, and the like.

The subject matter described herein can be embodied in systems,apparatus, methods, and/or articles depending on the desiredconfiguration. The implementations set forth in the foregoingdescription do not represent all implementations consistent with thesubject matter described herein. Instead, they are merely some examplesconsistent with aspects related to the described subject matter.Although a few variations have been described in detail above, othermodifications or additions are possible. In particular, further featuresand/or variations can be provided in addition to those set forth herein.For example, the implementations described above can be directed tovarious combinations and subcombinations of the disclosed featuresand/or combinations and subcombinations of several further featuresdisclosed above. In addition, the logic flows depicted in theaccompanying figures and/or described herein do not necessarily requirethe particular order shown, or sequential order, to achieve desirableresults. Other implementations may be within the scope of the followingclaims.

What is claimed is:
 1. A system for executing a sequence analysispipeline on genetic sequence data from an electronic data source thatprovides digital signals representing a plurality of reads of genomicdata, each of the plurality of reads of genomic data comprising asequence of nucleotides, the system comprising: a first memory storingone or more genetic reference sequences and an index of the one or moregenetic reference sequences, the index of the one or more geneticreference sequences further comprising a hash table; a second memorystoring one or more reference variants; and an integrated circuit formedof a set of hardwired digital logic circuits that are interconnected bya plurality of physical electrical interconnects, one or more of theplurality of physical electrical interconnects comprising an input tothe integrated circuit connected with the electronic data source forreceiving the plurality of reads of genomic data, one or more of theplurality of physical electrical interconnects further comprising amemory interface for the integrated circuit to access the memory, thehardwired digital logic circuits being arranged as a set of processingengines, each processing engine being formed of a subset of thehardwired digital logic circuits to perform one or more steps in thesequence analysis pipeline on the plurality of reads of genomic data,each subset of the hardwired digital logic circuits being in a wiredconfiguration to perform the one or more steps in the sequence analysispipeline, the set of processing engines comprising: a mapping module inthe wired configuration to access, according to at least some of thesequence of nucleotides in a read of the plurality of reads, the indexof the one or more genetic reference sequences from the memory via thememory interface to map the read to one or more segments of the one ormore genetic reference sequences or reference variants based on theindex, and to apply a hash function to the at least some of the sequenceof nucleotides to access the hash table of the index; and an alignmentmodule in the wired configuration to access the one or more geneticreference sequences from the memory via the memory interface to alignthe read to one or more positions in the one or more segments of the oneor more genetic reference sequences from the mapping module; and one ormore of the plurality of physical electrical interconnects comprising anoutput from the integrated circuit for communicating result data fromthe mapping module and/or the alignment module.
 2. The system inaccordance with claim 1, wherein the integrated circuit furthercomprises a master controller to establish the wired configuration foreach subset of the hardwired digital logic circuits to perform the oneor more steps in the sequence analysis pipeline.
 3. The system inaccordance with claim 1, wherein the integrated circuit comprises afield programmable gate array (FPGA) of the hardwired digital logiccircuits.
 4. The system in accordance with claim 1, wherein the wiredconfiguration is established upon manufacture of the integrated circuitand is non-volatile.
 5. A system for executing a sequence analysispipeline on genetic sequence data from an electronic data source thatprovides digital signals representing a plurality of reads of genomicdata, each of the plurality of reads of genomic data comprising asequence of nucleotides, the system comprising: a first memory storingone or more genetic reference sequences and an index of the one or moregenetic reference sequences, the index of the one or more geneticreference sequences further comprising a hash table; a second memorystoring one or more reference variants; and an integrated circuit formedof a set of hardwired digital logic circuits that are interconnected bya plurality of physical electrical interconnects, one or more of theplurality of physical electrical interconnects comprising an input tothe integrated circuit connected with the electronic data source forreceiving the plurality of reads of genomic data, one or more of theplurality of physical electrical interconnects further comprising amemory interface for the integrated circuit to access the memory, thehardwired digital logic circuits being arranged as a set of processingengines, each processing engine being formed of a subset of thehardwired digital logic circuits to perform one or more steps in thesequence analysis pipeline on the plurality of reads of genomic data,the one or more steps in the sequence analysis pipeline comprising:accessing, according to at least some of the sequence of nucleotides ina read of the plurality of reads, the index of the one or more geneticreference sequences from the memory via the memory interface; applying ahash function to the at least some of the sequence of nucleotides toaccess the hash table of the index; and mapping the read to one or moresegments of the one or more genetic reference sequences or referencevariants based on the index.
 6. The system in accordance with claim 5,wherein the one or more steps in the sequence analysis pipeline furthercomprise: accessing the one or more genetic reference sequences from thememory via the memory interface; and aligning the read to one or morepositions in the one or more segments of the one or more geneticreference sequences from the mapping to generate result data.
 7. Thesystem in accordance with claim 6, wherein the one or more steps in thesequence analysis pipeline further comprising: outputting the resultdata.
 8. The system in accordance with claim 5, wherein the integratedcircuit further comprises a master controller to establish the wiredconfiguration for each subset of the hardwired digital logic circuits toperform the one or more steps in the sequence analysis pipeline.
 9. Thesystem in accordance with claim 5, wherein the integrated circuitcomprises a field programmable gate array (FPGA) of the hardwireddigital logic circuits.
 10. The system in accordance with claim 5,wherein the wired configuration is established upon manufacture of theintegrated circuit and is non-volatile.